HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/abs/1801.07746.

Here, we explore this data set and try to answer the question, “What makes people happy?”

Step 0 - Load all the required libraries

From the packages’ descriptions:

devtools::install_github("lchiffon/wordcloud2")
library(tidyverse)
library(tidytext)
library(DT)
library(scales)
library(gridExtra)
library(ngram)
library(ggplot2)
library(wordcloud2)
library(tidyr)
library(reshape2)
library(tm)
library(topicmodels)

Step 1 - Load the processed text data along with demographic information on contributors

We use the processed data for our analysis and combine it with the demographic information available.

hm_data <- read_csv("../output/processed_moments.csv")

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)

Combine both the data sets and keep the required columns for analysis

We select a subset of the data that satisfies specific row conditions.

hm_data <- hm_data %>%
  inner_join(demo_data, by = "wid") %>%
  select(wid,
         original_hm,
         cleaned_hm,
         gender, 
         marital, 
         parenthood,
         reflection_period,
         age, 
         country, 
         ground_truth_category, 
         text) %>%
  mutate(count = sapply(hm_data$text, wordcount)) %>%
  filter(gender %in% c("m", "f")) %>%
  filter(marital %in% c("single", "married")) %>%
  filter(parenthood %in% c("n", "y")) %>%
  filter(reflection_period %in% c("24h", "3m")) %>%
  mutate(reflection_period = fct_recode(reflection_period,    #Change factor levels by hand
                                        months_3 = "3m", hours_24 = "24h"))

head(hm_data)
## # A tibble: 6 x 12
##     wid original_hm  cleaned_hm gender marital parenthood reflection_peri…
##   <int> <chr>        <chr>      <chr>  <chr>   <chr>      <fct>           
## 1  2053 I went on a… I went on… m      single  n          hours_24        
## 2     2 I was happy… I was hap… m      married y          hours_24        
## 3  1936 I went to t… I went to… f      married y          hours_24        
## 4   206 We had a se… We had a … f      married n          hours_24        
## 5    45 I meditated… I meditat… m      single  n          hours_24        
## 6   195 I made a ne… I made a … m      single  n          hours_24        
## # ... with 5 more variables: age <chr>, country <chr>,
## #   ground_truth_category <chr>, text <chr>, count <int>
datatable(hm_data)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Step 2 - Explorary Data Analysis on demographic distribution

Area vs Gender disrtibution

Although the dataset contains population from more than 30 countries, the majority of people are from the US and India, others can be ignored. Therefore only visualize these population.

df <- hm_data%>%select(gender, marital,parenthood, country,age) %>% 
      filter(country %in% c("USA", "IND")) %>% 
      group_by(country, gender) %>%  summarise(n = n())

      

ggplot(df,aes(x = country, y = n,fill = gender))+
   geom_bar(stat="identity",position = "dodge")

The people in USA and India consist of most of population, nearly 95.27%. However, population composed with imbalance gender, males are larger than females.

Marital vs Parenthood

Although a minor part of population are separeted, divorced, widowed, for the same reson, we only take the majority of populayion into account, that is single and married.

df <- hm_data%>%select(gender, marital,parenthood) %>% 
      filter(parenthood %in% c("y", "n")) %>%
      filter(marital %in% c("single", "married")) %>%
      group_by(marital, parenthood) %>%  summarise(n = n())

ggplot(df,aes(x = marital, y = n,fill = parenthood))+
   geom_bar(stat="identity",position = "dodge")

Looking at the status of population, the single are little more than the married. Especially, single with no children and married with children contribute the majority among total population, nearly 83.79%.

Step 3 - Segment population to compare different emotional categories between profiles

Segment population into 8 profiles by demogrphic features,using both gender,martal and parenthood features, analyze the different emotional categories between 8 profiles.

df_family <- hm_data%>%select(gender, marital,parenthood,ground_truth_category) %>% 
      filter(is.na(ground_truth_category) == FALSE) %>%
      unite(status, gender, marital,parenthood, sep  = "_" ) %>%
      group_by(status, ground_truth_category) %>% 
      summarise(n = n())%>% mutate(proportion = n / sum(n))

ggplot(df_family,aes(x = status, y = proportion,fill = ground_truth_category))+
   geom_bar(stat="identity",position = "dodge")

Segment population into 8 profiles using characters like gender, marital, and parenthood. As is shown in the barchart, affection and achievement are two dominate category in happy feeling, most people remembered their happy time when they obtained affection and achievement. In addition, we can compare the difference between different profiles. Although for the majority of profiles, affection was a dominate category in their happy moment, however, for single male, no matter have children or not, their happy moment most came from achievement. Compared to other profile, the married with no children, their happly moment were largely derived from leisure.

Step 4 - Explorary Data Analysis on Word Cloud

4.1Total text exploration using single word

Split total text into every single word, creating a bag of words using the text data

bag_of_words <-  hm_data %>%
  unnest_tokens(word, text)   #Split a column into tokens using the tokenizers package


word_count <- bag_of_words %>%  #high-frenquency word list 
  count(word, sort = TRUE)

#select top 200
word_count <- word_count[1:200,]

Dedelete meaningless words to make the wordcloud more insightful

set.seed(1234)
#delete meaningless words like "day, time"
word_count <- word_count[c(-2,-3),]
#wordcloud(word_count$word,word_count$n, min.freq =1000, colors=rainbow(15))
wordcloud2(word_count,size =0.5)

Using single word, happy moment are largely derive from friendship. In addition, family(wife,husband, son, daughter, mom, brother,sister), brithday, job, games also provide happy memories.

Plot barchart to visulize top 10 words.

ggplot(head(word_count,10), aes(reorder(word,n), n, fill = 1:10))+
  geom_bar(stat = "identity")+coord_flip() +
  xlab("Single Word")+ylab("Frequency")+
  ggtitle("Most Frequent Word")

What made you happy today? Uing single word, most of people think the friendship made them happy, then family, home.

However, somethings that aroused happy moment can not be expressed by a single words, so using single word miss some important information. Then we try bigrams.

4.2 Total text exploration using bigrams

hm_bigrams <- hm_data %>%
  filter(count != 1) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigram_counts <- hm_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>% #two-words list
  count(word1, word2, sort = TRUE)

#library(tidyr)
word_count_bigrams <- unite(bigram_counts, word, word1, word2, sep =" ")
word_count_bigrams <- word_count_bigrams[1:100,]
wordcloud2(word_count_bigrams,size =0.3)

What made you happy today? Uing bigrams, the moment of the spend time, vedio game, eating ice cream, watching movie, mother day and birthday party made them happy most. Moreover, favorite resturant, buying car, reading a book, spending time with people, grocery store provide happy memory for people

Plot barchart to visulize top 10 bigram words.

ggplot(head(word_count_bigrams,10), aes(reorder(word,n), n, fill = 1:10)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most Frequent bigrams")

4.3 3 months or 24 hours on wordcloud

We display the difference on word cloud taking account to the reflection period using sigle word and biagrams.

bag_of_words_3 <-  hm_data %>%
  filter(reflection_period %in% c("months_3") )%>%
  unnest_tokens(word, text)   #Split a column into tokens using the tokenizers package


word_count_3 <- bag_of_words_3 %>%  #high-frenquency word list 
  count(word, sort = TRUE)

#select top 200
word_count_3 <- word_count_3[1:200,]


bag_of_words_24 <-  hm_data %>%
  filter(reflection_period %in% c("hours_24") )%>%
  unnest_tokens(word, text)   #Split a column into tokens using the tokenizers package


word_count_24 <- bag_of_words_24 %>%  #high-frenquency word list 
  count(word, sort = TRUE)

#select top 200
word_count_24 <- word_count_24[1:200,]

#plot wordcloud
wordcloud2(word_count_3,size =0.5)
wordcloud2(word_count_24,size =0.5)

Compared two wordcloud, family, job, birthday are mentioned words in the happy moment in the past 3 month, words like morning, dinner are more popular in the past 24 hours.

hm_bigrams_3 <- hm_data %>%
  filter(count != 1) %>%
  filter(reflection_period %in% c("months_3") )%>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigram_counts_3 <- hm_bigrams_3 %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>% #two-words list
  count(word1, word2, sort = TRUE)

#library(tidyr)
word_count_bigrams_3 <- unite(bigram_counts_3, word, word1, word2, sep =" ")
word_count_bigrams_3 <- word_count_bigrams_3[1:100,]
wordcloud2(word_count_bigrams_3,size =0.3)

Using bigram to analyze happy moment in the past 3 month, mother day are most frequent bigram. In this three month period, saving money, loosing pounds, visting friends, couple week and college graduation also create happy moment.

hm_bigrams_24 <- hm_data %>%
  filter(count != 1) %>%
  filter(reflection_period %in% c("hours_24") )%>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigram_counts_24 <- hm_bigrams_24 %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>% #two-words list
  count(word1, word2, sort = TRUE)

#library(tidyr)
word_count_bigrams_24 <- unite(bigram_counts_24, word, word1, word2, sep =" ")
word_count_bigrams_24 <- word_count_bigrams_24[1:100,]
wordcloud2(word_count_bigrams_24,size =0.3)

Using bigram to analyze happy moment in the past 24 hours, vedio game, watched movie, talked with friend are mentioned by most of people. However, these little things fade away quickly.We can also notice from the wordcloud that little things like reading a book, walking dog, phone call, night sleep also made people feel happy.

Step 5 - Topic Modeling

Build corpus

set.seed(1)
hm_data <- hm_data[sample(1:nrow(hm_data), 20000),]
# compute document term matrix with terms 
corpus <- VCorpus(VectorSource(hm_data$text))

Model calculation

DTM <- DocumentTermMatrix(corpus)
dim(DTM)
## [1] 20000  9017
#due to vocabulary pruning, we have empty rows in our DTM, LDA does not like this. So remove them
rowTotals <- apply(DTM , 1, sum)
DTM  <- DTM[rowTotals> 0, ]

The topic model inference results in two (approximate) posterior probability distributions: a distribution theta over K topics within each document and a distribution beta over V terms within each topic, where V represents the length of the vocabulary of the collection (V = 9017). Let’s take a closer look at these results:

#Number of topics
k <- 7
                                            
#Run LDA using Gibbs sampling
topicModel <- LDA(DTM, k, method="Gibbs", control=list(iter = 500, seed = 1234))
terms(topicModel, 10)
##       Topic 1     Topic 2    Topic 3    Topic 4     Topic 5     
##  [1,] "feel"      "watched"  "home"     "day"       "finally"   
##  [2,] "happiness" "played"   "dinner"   "moment"    "job"       
##  [3,] "live"      "night"    "nice"     "time"      "husband"   
##  [4,] "helped"    "game"     "daughter" "event"     "love"      
##  [5,] "makes"     "favorite" "wife"     "life"      "house"     
##  [6,] "didnt"     "dog"      "sister"   "talked"    "received"  
##  [7,] "sleep"     "son"      "mother"   "finished"  "hours"     
##  [8,] "people"    "movie"    "lunch"    "parents"   "girlfriend"
##  [9,] "person"    "won"      "food"     "book"      "called"    
## [10,] "started"   "fun"      "eat"      "completed" "told"      
##       Topic 6    Topic 7   
##  [1,] "found"    "friend"  
##  [2,] "bought"   "time"    
##  [3,] "car"      "family"  
##  [4,] "money"    "day"     
##  [5,] "shopping" "enjoyed" 
##  [6,] "gift"     "birthday"
##  [7,] "free"     "school"  
##  [8,] "buy"      "visit"   
##  [9,] "cat"      "surprise"
## [10,] "morning"  "brother"
#have a look a some of the results (posterior distributions)
tmResult <- posterior(topicModel)
# format of the resulting object
attributes(tmResult)
## $names
## [1] "terms"  "topics"
# topics are probability distribtions over the entire vocabulary
beta <- tmResult$terms   # get beta from results
dim(beta)                # K distributions over nTerms(DTM) terms
## [1]    7 9017
nDocs(DTM)   
## [1] 19999
theta <- tmResult$topics 
dim(theta)    
## [1] 19999     7
topicNames <- c("affection", "enjoy the moment", "exercise", "leisure", "achievement", "nature", "boding")

Using hot words to nanme these 7 topics as “affection”, “enjoy the moment”, “exercise”, “leisure”, “achievement”, “nature”, “boding”.

Visualization of Topics

Then, we select 3 examples to visualize topics.

exampleIds <- c(1, 800, 1200)
lapply(corpus[exampleIds], as.character)
## $`1`
## [1] "sister bought android cell phone met phone surprise"
## 
## $`800`
## [1] "event women conference empower women elected form ecstatic"
## 
## $`1200`
## [1] "cold winter city day beautiful spring weather"
N <- length(exampleIds)
# get topic proportions form example documents
topicProportionExamples <- theta[exampleIds,]
colnames(topicProportionExamples) <- topicNames
vizDataFrame <- melt(cbind(data.frame(topicProportionExamples), document = factor(1:N)), variable.name = "topic", id.vars = "document")  

ggplot(data = vizDataFrame, aes(topic, value, fill = document), ylab = "proportion") + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol = N)

Form the barchart, we can see the topic distributions within the text. Longer bar implies higher possibility in that topic.